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Scientific  Progress 


This  project  addresses  the  fundamental  problem  of  how  to  tell  a  system  or  a  program  is  behaving  properly  without  being 
compromised  by  stealthy  malware.  During  the  course  of  the  project  (Apr.  201 1  -  Dec.  201 1 ),  the  PI  and  her  two  students  (Kui 
Xu  and  Karim  Elish)  have  performed  two  unique  studies  (user-centric  dependence  and  quantified  dependence  analysis)  all 
related  to  designing  novel  dependence  -based  anomaly  detection  solutions  that  aim  at  enforcing  dependence  properties  of 
legitimate  programs,  operations,  and  systems.  Anomaly  detection  has  never  been  systematically  studied  as  a  system  security 
approach  due  to  two  main  technical  challenges: 

i)  the  (normal)  behaviors  of  legitimate  programs  and  systems  are  diverse  and  difficult  to  define,  and 

ii)  unlike  numerical  attributes,  statistical  methods  cannot  be  applied  to  analyzing  programs  and  system  properties;  thus,  there  is 
no  general  enforcement  methodology  for  normal  system-security  patterns. 

Our  anomaly  detection  approach  is  to  focus  on  enforcing  the  proper  data  and  control  dependences  in  program  execution  and  to 
identify  any  violations  of  the  dependences.  Such  an  approach  yields  long-lasting  and  powerful  malware-classification  solutions, 
because  it  is  not  limited  by  the  constantly  evolving  behaviors  of  malware. 

In  user-centric  data  dependence,  we  completed  stepping-stone  studies  on  demonstrating  the  feasibility  of  simple  rule-based 
dependence  based  anomaly  detection  in  solving  popular  problems  such  as  detecting  drive-by  downloads  and  the  classification 
of  Android  apps.  These  studies  provided  a  solid  starting  point  for  our  future  investigations. 

We  define  the  property  of  user-centric  data  dependence  in  system  or  program  as  that  the  system  events  or  function  calls  need 
to  be  directly  or  indirectly  in  response  to  user  actions,  commands,  or  inputs.  Despite  the  simplicity  of  this  definition  and  the 
intuitive  assumption  on  patterns  of  user-system  interaction,  we  found  such  a  data-dependence  specification  is  sufficient  for 
many  anomaly  detection  needs  in  practice.  We  have  two  demonstrations  of  its  usefulness  in  security  applications. 

1.  DBD  Detection  (i.e. ,  detecting  malware-triggered  file  download) 

We  first  demonstrate  a  concrete  security  application  of  enforcing  user-centric  data  dependence  in  the  context  of  file  system 
access  and  drive-by  download  (DBD)  detection  on  a  host.  Our  work  appeared  in  the  Proceedings  of  Network  and  System 
Security  Conference  (NSS  '1 1)  [Xu  2011]  and  the  journal  version  is  under  review  at  ACM  TISSEC.  We  collected  file-system 
events  at  the  system  call  level  and  user-input  events  to  a  browser  through  keyboard  and  mouse  hooks  (in  Windows),  and  used 
a  rule-based  decision  tree  for  classifying  where  a  file  creation  request  should  be  allowed  to  happen  or  not.  Our  prototype  is 
browser  independent,  and  can  accurately  identify  the  benign  browser-generated  temporary  file  creations  with  low  false  positive 
rate.  We  spent  significant  efforts  in  refining  and  evaluating  our  prototype  including  user  studies  (with  21  participants)  for 
analyzing  proper  threshold  values,  demonstrating  our  ability  to  detect  6  reproduced  DBD  exploits,  the  ability  to  detection  84 
websites  containing  live  DBD  exploits,  and  automatically  evaluating  top  2000  (legitimate)  websites  ranked  byAlexa.com  (no 
false  alarm 

found).  Select  results  are  shown  below  in  Figure  1  (see  attachment). 

2.  Classification  of  Apps 

Previous  DBD  study  treated  the  program  (i.e.,  browser)  as  a  black  box.  In  this  study,  we  perform  white-box  program  analysis  for 
enforcing  the  dependence  in  data  flow.  We  focus  on  function  calls  to  access  the  critical  system  resources  (such  as  network  I/O, 
file  I/O,  audio  interface),  and  inspect  the  dependence  of  their  arguments  on  any  user  inputs  taken  by  the  program.  Our 
hypothesis  is  that 

requests  to  system  resources  in  legitimate  programs  are  typically  triggered  by  user  inputs  and  action,  however,  malware  that 
abuses  the  system  does  not.  We  have  developed  automatic  tools  based  on  Soot  (a  static  analysis  toolkit  for  Java)  for  obtaining 
context  sensitive  data-dependence  graphs.  We  found  that  in  all  legitimate  programs,  all  function  calls  depended  on  user  inputs, 
i.e.,  user  needs  to  enter  certain  information  before  the  request  to  the  call  is  made.  In  most  of  the  malicious  Android  apps  (3  out 
of  4),  this  property  of  data  dependence  is  not  observed;  the  malicious  apps  abuse  the  system  resources  without  user’s 
authorization  -  confirming  our  hypothesis  on  the  differences  in  user-centric  data-dependence  behaviors  of  legitimate  and 
malicious  programs.  The  last  malware  tested  (Fakeneflic)  is  a  phishing  app  that  tricks  the  user  to  enter  their  Netflix  login. 
Detecting  it  is  out  of  our  scope  and  requires  site  authentication  (i.e.,  certification  verification)  and  user  education.  The 
preliminary  results  are  shown  in  Table  1  (see  attachment).  Our  work  appeared  in  IEEE  MoST  Workshop  in  2012  [Elish  2012]. 
We  are  currently  performing  more  evaluation,  and  plan  to  submit  our  full-version  work  to  IEEE  Security  &  Privacy  Symposium 
2013. 

Summary  of  the  most  important  results: 

1.  We  demonstrated  the  feasibility  of  user-intention  based  dependency  analysis  as  a  general  and  powerful  methodology  for 
anomaly  detection  and  system  assurance. 

2.  We  produced  practical  tools  that  can  be  readily  used,  including  one  for  detecting  DBD  attacks  [Xu  201 1],  one  for  classifying 
apps  written  in  Java  [Elish  2012],  and  one  for  identifying  anomalous  traffic  [Zhang  2012], 

3.  Our  other  work  includes  a  feasibility  study  on  DNS-based  botnet  C&C  [Butler  201 1],  cryptographic  provenance  verification  for 
system  assurance  [Xu  2012],  and  process  identification  [Almohri  2012],  Please  find  details  of  these  studies  in  the  enclosed 


journal/conference  versions  submitted. 
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Overview 

This  project  addresses  the  fundamental  problem  of  how  to  tell  a  system  or  a  program  is  behaving 
properly  without  being  compromised  by  stealthy  malware.  During  the  course  of  the  project  (Apr.  2011  -  Dec. 
2011),  the  PI  and  her  students  performed  studies  related  to  designing  novel  dependence-based  anomaly  detection 
solutions  that  aim  at  enforcing  dependence  properties  of  legitimate  programs,  operations,  and  systems.  Anomaly 
detection  has  never  been  systematically  studied  as  a  system  security  approach  due  to  two  main  technical 
challenges: 

i)  the  (normal)  behaviors  of  legitimate  programs  and  systems  are  diverse  and  difficult  to  define,  and 

ii)  unlike  numerical  attributes,  statistical  methods  cannot  be  applied  to  analyzing  programs  and  system 

properties;  thus,  there  is  no  general  enforcement  methodology  for  normal  system-security  patterns. 

Our  anomaly  detection  approach  is  to  focus  on  enforcing  the  proper  data  and  control  dependences  in 
program  execution  and  to  identify  any  violations  of  the  dependences.  Such  an  approach  yields  long-lasting  and 
powerful  malware-classification  solutions,  because  it  is  not  limited  by  the  constantly  evolving  behaviors  of 
malware.  For  user-centric  data  dependence,  we  completed  stepping-stone  studies  on  demonstrating  the  feasibility 
of  simple  rule-based  dependence  based  anomaly  detection  in  solving  popular  problems  such  as  detecting  drive  - 
by  downloads  and  the  classification  of  Android  apps. 

Report  on  User-Centric  Dependence 

We  define  the  property  of  user-centric  data  dependence  in  system  or  program  as  that  the  system  events 
or  function  calls  need  to  be  directly  or  indirectly  in  response  to  user  actions,  commands,  or  inputs.  Despite  the 
simplicity  of  this  definition  and  the  intuitive  assumption  on  patterns  of  user-system  interaction,  we  found  such  a 
data-dependence  specification  is  sufficient  for  many  anomaly  detection  needs  in  practice.  We  highlight  some 
demonstrations  of  its  usefulness  in  security  applications. 

1.  DBD  Detection  (i.e.,  detecting  malware-triggered  fde  download ) 

We  first  demonstrate  a  concrete  security  application  of  enforcing  user-centric  data  dependence  in  the 
context  of  file  system  access  and  drive-by  download  (DBD)  detection  on  a  host.  Our  work  appeared  in  the 
Proceedings  of  Network  and  System  Security  Conference  (NSS  'll)  [Xu  2011]  and  the  journal  version  is  under 
review  at  Computers  &  Security. 

We  collected  file-system  events  at  the  system  call  level  and  user-input  events  to  a  browser  through 
keyboard  and  mouse  hooks  (in  Windows),  and  used  a  rule-based  decision  tree  for  classifying  where  a  file 
creation  request  should  be  allowed  to  happen  or  not.  Our  prototype  is  browser-independent,  and  can  accurately 
identify  the  benign  browser-generated  temporary  file  creations  with  low  false  positive  rate.  We  spent  significant 
efforts  in  refining  and  evaluating  our  prototype  including  user  studies  (with  2 1  participants)  for  analyzing  proper 
threshold  values,  demonstrating  our  ability  to  detect  6  reproduced  DBD  exploits,  the  ability  to  detection  84 
websites  containing  live  DBD  exploits,  and  automatically  evaluating  top  2000  (legitimate)  websites  ranked  by 
Alexa.com  (no  false  alarm  found).  Select  results  are  shown  below  in  Figure  1. 
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Figure  1.  (a)  Comparison  of  the  false  positive  rates  in  the  temporal-only  dependence  analysis  (blue)  with  user 
data  and  the  semantic-based  dependence  analysis  (red)  where  hyperlinks  associated  with  mouse-click  events  are 
also  used  in  defining  security  rules,  (b)  The  work  flow  of  our  prototype  DeWare  (standing  for  Deletion  of 
Malware). 


2.  Classification  ofApps 

Previous  DBD  study  treated  the  program  (i.e.,  browser)  as  a  black  box.  In  this  study,  we  perform  white- 
box  program  analysis  for  enforcing  the  dependence  in  data  flow.  We  focus  on  function  calls  to  access  the  critical 
system  resources  (such  as  network  I/O,  file  I/O,  audio  interface),  and  inspect  the  dependence  of  their  arguments 
on  any  user  inputs  taken  by  the  program.  Our  hypothesis  is  that  requests  to  system  resources  in  legitimate 
programs  are  typically  triggered  by  user  inputs  and  action,  however,  malware  that  abuses  the  system  does  not. 

We  have  developed  automatic  tools  based  on  Soot  (a  static  analysis  toolkit  for  Java)  for  obtaining 
context  sensitive  data-dependence  graphs.  We  found  that  in  all  legitimate  programs,  all  function  calls  depended 
on  user  inputs,  i.e.,  user  needs  to  enter  certain  information  before  the  request  to  the  call  is  made.  In  most  of  the 
malicious  Android  apps  (3  out  of  4),  this  property  of  data  dependence  is  not  observed;  the  malicious  apps  abuse 
the  system  resources  without  user’s  authorization  -  confirming  our  hypothesis  on  the  differences  in  user-centric 
data-dependence  behaviors  of  legitimate  and  malicious  programs.  The  last  malware  tested  (Fakeneflic)  is  a 
phishing  app  that  tricks  the  user  to  enter  their  Netflix  login.  Detecting  it  is  out  of  our  scope  and  requires  site 
authentication  (i.e.,  certification  verification)  and  user  education.  The  preliminary  results  are  shown  in  Table  1. 
Our  work  appeared  in  IEEE  MoST  workshop  [Elish  2012].  We  are  currently  performing  more  evaluation  and 
formalization  of  the  work,  and  plan  to  submit  our  full-version  work  to  IEEE  Security  &  Privacy  Symposium 
2013. 


Program  Name 

Num.  of 

User  Inputs 

%  of  Sensitive  Func.  Calls  without 
User  Inputs/Sensitive  Info  * 

Types  of  Function  Calls 

Legitimate 

URLConnectionReader 

1 

0% 

console  I/O,  networking 

MailSender 

5 

0% 

javax.mail,  console  I/O 

UDPSendFileContent 

i 

0% 

file  I/O,  networking 

Send  SMS  App 

2 

0% 

telephony.GSM 

Malware 

EmailSpammer  (proof  of  concept) 

0 

100% 

javax.mail 

GGTracker.A  (forwarding  SMS) 

0 

100% 

networking 

HippoSMS  (sending  SMS) 

0 

100% 

telephony.GSM 

Android.Fakeneflic  (Netflix) 

2 

0% 

networking 

*  Number  of  sensitive  function  calls  in  these  samples  is  one. 

Table  1.  Comparison  of  dependence  properties  in  legitimate  and  malicious  Android  apps. 
3.  Traffic  Dependency  Analysis  for  Network  Security 


In  this  work,  we  investigated  an  approach  to  enforce  dependencies  between  network  traffic  and  user  activities  for 


anomaly  detection.  We  presented  a  framework  and  algorithms  that  analyze  user  actions  and  network  events  on  a 
host  according  to  their  dependencies.  Discovering  these  relations  is  useful  in  identifying  anomalous  events  on  a 
host  that  are  caused  by  software  flaws  or  malicious  code.  To  demonstrate  the  feasibility  of  user  intention  based 
traffic  dependence  analysis,  we  implement  a  prototype  called  CR-Miner  and  perform  extensive  experimental 
evaluation  of  the  accuracy,  security,  and  efficiency  of  our  algorithm.  The  results  show  that  our  algorithm  can 
identify  user  intention-based  traffic  dependence  with  high  accuracy  (average  99.6%  for  20  users)  and  low  false 
alarms.  Our  prototype  can  successfully  detect  several  pieces  of  HTTP-based  real-world  spyware.  Our 
dependence  analysis  is  fast  with  a  minimal  storage  requirement.  We  give  a  thorough  analysis  on  the  security  and 
robustness  of  the  user  intention-based  traffic  dependence  approach.  This  work  appeared  in  2012  IEEE  Workshop 
on  Semantics  and  Security  (WSCS)  [Zhang  2012],  The  full  version  of  the  work  with  expanded  experiments  and 
modeling  work  was  submitted  to  ACM  TISSEC. 

Summary  of  the  most  important  results: 

1.  We  demonstrated  the  feasibility  of  user-intention  based  dependency  analysis  as  a  general  and  powerful 
methodology  for  anomaly  detection  and  system  assurance. 

2.  We  produced  practical  tools  that  can  be  readily  used,  including  one  for  detecting  DBD  attacks  [Xu  2011],  one 
for  classifying  apps  written  in  Java  [Elish  2012],  and  one  for  identifying  anomalous  traffic  [Zhang  2012], 

3.  Our  other  work  includes  a  feasibility  study  on  DNS-based  botnet  C&C  [Butler  2011],  cryptographic  provenance 
verification  for  system  assurance  [Xu  2012],  and  process  identification  [Almohri  2012].  Please  find  details  of 
these  studies  in  the  enclosed  journal/conference  versions  submitted. 
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